Project Problem Statement - Potential Customers Prediction¶
Context¶
The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023, with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc., it is now preferable to traditional education.
The online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like:
The customer interacts with the marketing front on social media or other online platforms. The customer browses the website/app and downloads the brochure. The customer connects through emails for more information. The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
Objective¶
ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate the resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
Analyze and build an ML model to help identify which leads are more likely to convert to paid customers. Find the factors driving the lead conversion process. Create a profile of the leads who are likely to convert.
Learning Outcomes¶
EDA(Univariate Analysis, Multi Variate Analysis)
Visualization
Data Preprocessing(Log transformations, Outlier Treatment, Missing Value Treatment, Feature Engineering)
Classification Models(Logistic Regression, Descision Trees, Random Forest)
Model Performance evaluation and improvement (Cross Validation Techniques)
Data Dictionary¶
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
ID: ID of the lead
age: Age of the lead
current_occupation: Current occupation of the lead. Values include 'Professional', 'Unemployed', and 'Student'
first_interaction: How did the lead first interact with ExtraaLearn? Values include 'Website' and 'Mobile App'
profile_completed: What percentage of the profile has been filled by the lead on the website/mobile app? Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
website_visits: The number of times a lead has visited the website time_spent_on_website: Total time spent on the website page_views_per_visit: Average number of pages on the website viewed during the visits
last_activity: Last interaction between the lead and ExtraaLearn
Email Activity: Seeking details about the program through email, Representative shared information with a lead like a brochure of the program, etc.
Phone Activity: Had a phone conversation with a representative, had a conversation over SMS with a representative, etc.
Website Activity: Interacted on live chat with a representative, updated profile on the website, etc.
print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper
print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine
digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms
educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
status: Flag indicating whether the lead was converted to a paid customer or not.
Importing the necessary libraries and overview of the dataset¶
# Import warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Algorithms to use
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, classification_report, f1_score
from sklearn import metrics
from xgboost import XGBRegressor, XGBClassifier
import multiprocessing
import shap
import xgboost as xgb
Loading the data¶
customer = pd.read_csv("ExtraaLearn.csv")
# Copying data to another variable to avoid any changes to original data
data = customer.copy()
View the first and the last 5 rows of the dataset¶
data.head()
| ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
| 1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
| 2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
| 3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
| 4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
data.tail()
| age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | engagement_score | interaction_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4607 | 35 | Unemployed | Mobile App | Medium | 2.772589 | 5.888878 | 1.153732 | Phone Activity | No | No | No | Yes | No | 0 | 6.794185 | 1.287342 |
| 4608 | 55 | Professional | Mobile App | Medium | 2.197225 | 7.752765 | 1.855204 | Email Activity | No | No | No | No | No | 0 | 14.382958 | 0.769551 |
| 4609 | 58 | Professional | Website | High | 1.098612 | 5.361292 | 1.306168 | Email Activity | No | No | No | No | No | 1 | 7.002750 | 0.476380 |
| 4610 | 57 | Professional | Mobile App | Medium | 0.693147 | 5.043425 | 1.584940 | Website Activity | Yes | No | No | No | No | 0 | 7.993528 | 0.268148 |
| 4611 | 55 | Professional | Website | Medium | 1.609438 | 7.736744 | 1.123305 | Phone Activity | No | No | No | No | No | 0 | 8.690722 | 0.757987 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
The dataset has 4,612 rows and 15 columns.
age,website_visits,time_spent_on_website,statusandpage_views_per_visitare of numeric, while the rest of the columns are objectsThere are no null values in the dataset.
ID is an identifier. Let's check if each entry of the column is unique.
Observations:
- We can see that all the entries of this column are unique. Hence, this column would not add any value to our analysis.
- Let's drop this column.
data = data.drop(["ID"], axis = 1)
data.head()
| age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
| 1 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
| 2 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
| 3 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
| 4 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
Exploratory Data Analysis and Data Preprocessing¶
Summary Statistics for numerical columns¶
#Observations
# Selecting numerical columns and checking the summary statistics
num_cols = data.select_dtypes('number').columns
data[num_cols].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 4612.0 | 46.201214 | 13.161454 | 18.0 | 36.00000 | 51.000 | 57.00000 | 63.000 |
| website_visits | 4612.0 | 3.566782 | 2.829134 | 0.0 | 2.00000 | 3.000 | 5.00000 | 30.000 |
| time_spent_on_website | 4612.0 | 724.011275 | 743.828683 | 0.0 | 148.75000 | 376.000 | 1336.75000 | 2537.000 |
| page_views_per_visit | 4612.0 | 3.026126 | 1.968125 | 0.0 | 2.07775 | 2.792 | 3.75625 | 18.434 |
| status | 4612.0 | 0.298569 | 0.457680 | 0.0 | 0.00000 | 0.000 | 1.00000 | 1.000 |
Observations:
Age:
Mean: 46.2 years; Median: 51 years → Slight left skew driven by younger ages.
Most ages fall between 36 (25th percentile) and 57 (75th percentile).
Range: 18 (youngest) to 63 (oldest).
Website Visits:
Mean: 3.57 visits; Median: 3 visits → Right skew observed.
Range: 0 to 30 visits, with most users visiting 2–5 times (IQR).
Subset of users showed high engagement (maximum of 30 visits).
Time Spent on Website:
Mean: 724 seconds (~12 minutes); Median: 376 seconds (~6 minutes) → Right skew present.
Range: 0 to over 2500 seconds (~40+ minutes), with significant variation in engagement.
Most users spent 148.75 (25th percentile) to 1336.75 (75th percentile) seconds.
Page Views Per Visit:
Mean: 3.03 pages; Median: 2.79 pages → Right skew detected.
Range: 0 to 18.43 pages, with the majority viewing 2–4 pages (IQR).
Status:
- The binary target variable indicates 30% of users belong to category 1.
Checking the distribution and outliers for numerical columns in the data¶
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1,2,1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
age Skew : -0.72
website_visits Skew : 2.16
time_spent_on_website Skew : 0.95
page_views_per_visit Skew : 1.27
Observations:
Age (-0.72 skew): Most ages are in the higher range; consistent, few outliers.
Website Visits (2.16 skew): Most visits are low, but a few users visit a lot; many outliers.
Time Spent (0.95 skew): Most spend little time, with a moderate tail toward longer times.
Pages per View (1.27 skew): Most view few pages, but some view significantly more; many outliers.
Check the percentage of each category for categorical variables.
#Check categorical variables. Status is numerical but can also be categorical
cat_cols = ['current_occupation', 'first_interaction', 'profile_completed', 'last_activity', 'print_media_type1',
'print_media_type2', 'digital_media', 'educational_channels', 'referral','status']
for col in cat_cols:
print(data[col].value_counts(normalize = True)) # The parameter normalize = True gives the percentage of each category
print('*'*40)
current_occupation Professional 0.567216 Unemployed 0.312446 Student 0.120338 Name: proportion, dtype: float64 **************************************** first_interaction Website 0.551171 Mobile App 0.448829 Name: proportion, dtype: float64 **************************************** profile_completed High 0.490893 Medium 0.485906 Low 0.023200 Name: proportion, dtype: float64 **************************************** last_activity Email Activity 0.493929 Phone Activity 0.267563 Website Activity 0.238508 Name: proportion, dtype: float64 **************************************** print_media_type1 No 0.892238 Yes 0.107762 Name: proportion, dtype: float64 **************************************** print_media_type2 No 0.94948 Yes 0.05052 Name: proportion, dtype: float64 **************************************** digital_media No 0.885733 Yes 0.114267 Name: proportion, dtype: float64 **************************************** educational_channels No 0.847138 Yes 0.152862 Name: proportion, dtype: float64 **************************************** referral No 0.979835 Yes 0.020165 Name: proportion, dtype: float64 **************************************** status 0 0.701431 1 0.298569 Name: proportion, dtype: float64 ****************************************
Observations:
Current Occupation: Most are professionals (56.7%), with smaller proportions of unemployed (31.2%) and students (12.0%).
First Interaction: Slightly more users interacted via the website (55.1%) than the mobile app (44.9%).
Profile Completed: High and medium completion levels are nearly equal (49.1% and 48.6%), with very few low completions (2.3%).
Last Activity: Email activity dominates (49.4%), followed by phone activity (26.8%) and website activity (23.9%).
Print Media: Both types show low engagement, with "No" responses at 89.2% and 94.9% respectively.
Digital Media: Most users do not engage (88.6%), while 11.4% do.
Educational Channels: Minimal use, with "Yes" at 15.3%.
Referral: Rare, with only 2.0% referred.
Status: 29.9% converted
Bivariate analysis.
# List of columns to plot
columns_to_plot = ['current_occupation', 'first_interaction', 'profile_completed',
'last_activity', 'print_media_type1', 'print_media_type2',
'digital_media', 'educational_channels', 'referral']
# Loop through each column
for col in columns_to_plot:
# Create smaller count plots
plt.figure(figsize=(6,3)) # Reduced size
sns.countplot(x=col, hue='status', data=data)
plt.title(f"Count Plot for {col} by Status", fontsize=12)
plt.xlabel(col.capitalize(), fontsize=10)
plt.ylabel("Count", fontsize=10)
plt.legend(title='Status', fontsize=8) # Adjust legend font size
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.show()
# Create stacked bar plots with percentages
if col != 'Attrition':
crosstab = pd.crosstab(data[col], data['status'], normalize='index')*100
ax = crosstab.plot(kind='bar', figsize=(8, 4), stacked=True)
plt.ylabel('Percentage Status %')
plt.title(f"Stacked Bar Plot for {col} by Status", fontsize=12)
plt.xlabel(col.capitalize(), fontsize=10)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.legend(title='Status', fontsize=8)
# Annotate bars with percentages
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height:.1f}%', (x + width/2, y + height/2), ha='center', va='center', fontsize=8, color='black')
plt.show()
Observations:
35% of professionals, 26.6% of unemployed individuals, and 11.7% of students converted to paid customers. Professionals had the largest volume of data available.
For first interactions, 45.6% of users converted via the website, compared to 10.5% via the mobile app.
Regarding profile completion, 41.8% of users with completed profiles converted, compared to 18.9% for medium completion and 7.5% for low completion.
In terms of last activity, the website had the highest conversion rate at 38.5%, followed by e-mail activity at 30.3% and phone activity at 21.3%. E-mail activity had the most extensive data available.
Referrals drove a conversion rate of 67.7%, significantly higher than the 29.1% conversion rate for non-referrals.
# List of numerical columns to loop through
columns_to_plot = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
# Function to create both boxplots and a pair plot
def create_visualizations(data, columns, x_col):
# Loop to create boxplots for each numerical column
for col in columns:
plt.figure(figsize=(10, 6))
# Create the boxplot
sns.boxplot(data=data, x=x_col, y=col)
# Add labels and title
plt.title(f'Boxplot of {col} by {x_col}')
plt.xlabel(x_col.capitalize())
plt.ylabel(col.capitalize())
# Show plot
plt.show()
# Create a pair plot for all numerical columns with the target variable as hue
sns.pairplot(data, hue=x_col, vars=columns, height=2.5)
plt.suptitle("Pair Plots for Numerical Features by Status", y=1.02, fontsize=14) # Adding a title
plt.show()
# Call the function
create_visualizations(data, columns_to_plot, x_col="status")
Observations:
Age and the amount of time spent on the website positively influence conversion rates.
For users who converted, the time spent on the website shows significant variability.
Among users who did not convert, the time spent on the website exhibits a right-skewed distribution with noticeable outliers.
Both page views per visit and the number of website visits demonstrate the presence of outliers and right-skewed distributions.
Apply log transformation to reduce skewness and add additional features:¶
Based on the above and re-examining the data, I will adjust for skewness:
Time Spent on Website:
Median (376) is much smaller than the mean (724), with a wide range (0 to 2537). This is likely highly right-skewed.
Log transformation is strongly recommended.
Website Visits:
Mean is higher than the median (3.566 vs. 3), and the max value (30) is quite far from the 75th percentile (5), indicating right skewness.
Log transformation would be helpful.
Page Views per Visit:
Mean (3.026) is slightly higher than the median (2.792), and the max (18.434) is far from the 75th percentile (3.75625). This indicates moderate right skewness.
Log transformation could improve the distribution.
Include Features for Classification (Exclude in Regression)¶
- Certain fields are interconnected, so I've combined time_spent_on_website and page_views_per_visit into an engagement score to summarize user activity.
- Additionally, I've introduced an interaction ratio to evaluate the intensity of user interactions on the website.e
# Log Transformation to address skewness above
for col in ['website_visits', 'time_spent_on_website', 'page_views_per_visit']:
data[col] = np.log1p(data[col]) # log(1+x) to avoid log(0)
# Creating new features
data["engagement_score"] = data["time_spent_on_website"] * data["page_views_per_visit"]
data["interaction_ratio"] = data["website_visits"] / (data["page_views_per_visit"] + 1) # Prevent division by zero
# Review data after transformation:
# Extend the list of numerical columns to include the new features
columns_to_plot = [
'age',
'website_visits',
'time_spent_on_website',
'page_views_per_visit',
'engagement_score',
'interaction_ratio'
]
# Function to create both boxplots and a pair plot to see if skewness is reduced
def create_visualizations(data, columns, x_col):
# Loop to create boxplots for each numerical column
for col in columns:
plt.figure(figsize=(10, 6))
# Create the boxplot
sns.boxplot(data=data, x=x_col, y=col)
# Add labels and title
plt.title(f'Boxplot of {col} by {x_col}')
plt.xlabel(x_col.capitalize())
plt.ylabel(col.capitalize())
# Show plot
plt.show()
# Create a pair plot for all numerical columns with the target variable as hue
sns.pairplot(data, hue=x_col, vars=columns, height=2.5)
plt.suptitle("Pair Plots for Numerical Features by Status", y=1.02, fontsize=14) # Adding a title
plt.show()
# Call the function
create_visualizations(data, columns_to_plot, x_col="status")
Observations:
The transformations have reduced skewness. Website Visits and Time Spent on Website: The transformations have reduced skewness, resulting in smoother distributions for both status 0 and status 1. Converted users (status 1) still appear more prominent at higher values, emphasizing their engagement.
Conversion rates (status 1) still appear strongly associated with higher values of engagement-related metrics (time_spent_on_website and page_views_per_visit).
Page Views per Visit: After scaling or transformation, the distributions now highlight subtle differences between the groups, with status 1 users exhibiting slightly longer tails in the higher values.
Pairwise correlations between all the variables.
plt.figure(figsize=(10, 7))
# Select only the numeric columns
datanumbers = data.select_dtypes(include='number')
# Plot the heatmap
sns.heatmap(datanumbers.corr(), annot=True, fmt=".2f")
plt.show()
Observations:
This highlights the positive correlation between status and age (12%) as well as status time spent on the website (25%).
The newly created engagement score also shows positive correlation with status.
Among the variables analyzed, time spent on the website exhibited the strongest correlation, reinforcing its significance in driving outcomes.
Preparing the data for modeling¶
# Separating the target variable and other variables
X = data.drop(columns = 'status')
Y = data['status']
# Creating dummy variables
X = pd.get_dummies(X, drop_first = True)
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)
Create a regression model and scale the dataset for adjustment. Later, we’ll use this to compare against classification models like Decision Trees and Random Forest along with XGBoost.¶
# This excludes the new features engagement score and interaction ratio and uses original features for cleaner interpretation and to avoid multicollinearity
# Function to evaluate metrics and plot confusion matrix
def metrics_score(actual, predicted, title="Confusion Matrix"):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Converter', 'Converter'], yticklabels=['Non-Converter', 'Converter'])
plt.title(title)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# 1. Define features and target
X = data[['age', 'current_occupation', 'first_interaction', 'profile_completed',
'website_visits', 'time_spent_on_website', 'page_views_per_visit',
'last_activity', 'print_media_type1', 'print_media_type2',
'digital_media', 'educational_channels', 'referral']]
y = data['status']
# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Identify numerical and categorical columns
numerical_columns = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
categorical_columns = ['current_occupation', 'first_interaction', 'profile_completed',
'last_activity', 'print_media_type1', 'print_media_type2',
'digital_media', 'educational_channels', 'referral']
# 4. Preprocessor: Scale numerical columns and encode categorical columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_columns),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
]
)
# 5. Create a pipeline with preprocessing and Logistic Regression
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', LogisticRegression())
])
# 6. Train the model using the pipeline
pipeline.fit(X_train, y_train)
# 7. Make predictions on the test set
y_pred = pipeline.predict(X_test)
# 8. Make predictions on the training set and evaluate
y_pred_train = pipeline.predict(X_train) # Corrected y_pred_train1 to y_pred_train
print("\n--- Training Set Performance ---")
metrics_score(y_train, y_pred_train, title="Confusion Matrix (Training Set)")
# 9. Evaluate performance on the test set
print("\n--- Test Set Performance ---")
metrics_score(y_test, y_pred)
# 10. Extract Coefficients and Feature Importance
# Get feature names (numerical + one-hot encoded)
feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_columns)
all_feature_names = numerical_columns + list(feature_names)
# 10a. Check for Multicollinearity using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
# Preprocessed data for VIF calculation
X_preprocessed = pipeline.named_steps['preprocessor'].transform(X_train)
vif_data = pd.DataFrame()
vif_data["Feature"] = all_feature_names
vif_data["VIF"] = [variance_inflation_factor(X_preprocessed, i) for i in range(X_preprocessed.shape[1])]
print("\n--- Variance Inflation Factor (VIF) ---")
print(vif_data)
# Extract coefficients from the logistic regression model
coefficients = pipeline.named_steps['model'].coef_[0]
# Combine feature names, coefficients, and odds ratios
feature_importance = pd.DataFrame({
'Feature': all_feature_names,
'Coefficient': coefficients,
'Odds Ratio': np.exp(coefficients) # Convert coefficients to odds ratios
}).sort_values(by='Coefficient', ascending=False)
print("\n--- Feature Importance ---")
print(feature_importance)
# 11. Plot Confusion Matrix for the Test Set
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Converter', 'Converter'], yticklabels=['Non-Converter', 'Converter'])
plt.title("Confusion Matrix (Test Set)")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# 12. Plot the ROC Curve and AUC
y_prob = pipeline.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f"AUC: {roc_auc_score(y_test, y_prob):.2f}")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
--- Training Set Performance ---
precision recall f1-score support
0 0.86 0.92 0.89 2586
1 0.77 0.65 0.71 1103
accuracy 0.84 3689
macro avg 0.82 0.78 0.80 3689
weighted avg 0.83 0.84 0.83 3689
--- Test Set Performance ---
precision recall f1-score support
0 0.86 0.92 0.89 649
1 0.77 0.63 0.69 274
accuracy 0.83 923
macro avg 0.81 0.78 0.79 923
weighted avg 0.83 0.83 0.83 923
/usr/local/lib/python3.11/dist-packages/statsmodels/stats/outliers_influence.py:197: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i)
--- Variance Inflation Factor (VIF) ---
Feature VIF
0 age 2.001388
1 website_visits 1.150939
2 time_spent_on_website 1.217417
3 page_views_per_visit 1.131902
4 current_occupation_Professional inf
5 current_occupation_Student inf
6 current_occupation_Unemployed inf
7 first_interaction_Mobile App inf
8 first_interaction_Website inf
9 profile_completed_High inf
10 profile_completed_Low inf
11 profile_completed_Medium inf
12 last_activity_Email Activity inf
13 last_activity_Phone Activity inf
14 last_activity_Website Activity inf
15 print_media_type1_No inf
16 print_media_type1_Yes inf
17 print_media_type2_No inf
18 print_media_type2_Yes inf
19 digital_media_No inf
20 digital_media_Yes inf
21 educational_channels_No inf
22 educational_channels_Yes inf
23 referral_No inf
24 referral_Yes inf
--- Feature Importance ---
Feature Coefficient Odds Ratio
9 profile_completed_High 1.274829 3.578090
8 first_interaction_Website 1.127395 3.087602
2 time_spent_on_website 1.022725 2.780761
24 referral_Yes 0.686616 1.986981
4 current_occupation_Professional 0.652937 1.921174
14 last_activity_Website Activity 0.411688 1.509364
0 age 0.120159 1.127676
6 current_occupation_Unemployed 0.060079 1.061921
18 print_media_type2_Yes 0.037712 1.038432
12 last_activity_Email Activity 0.012840 1.012923
16 print_media_type1_Yes -0.036738 0.963928
20 digital_media_Yes -0.060821 0.940991
21 educational_channels_No -0.127003 0.880731
22 educational_channels_Yes -0.141959 0.867657
1 website_visits -0.149409 0.861217
3 page_views_per_visit -0.150737 0.860074
19 digital_media_No -0.208141 0.812092
15 print_media_type1_No -0.232224 0.792768
17 print_media_type2_No -0.306674 0.735890
11 profile_completed_Medium -0.307595 0.735213
13 last_activity_Phone Activity -0.693491 0.499828
23 referral_No -0.955579 0.384589
5 current_occupation_Student -0.981978 0.374569
10 profile_completed_Low -1.236197 0.290487
7 first_interaction_Mobile App -1.396357 0.247497
Observations:
Performance:
Training accuracy is 83%, and test accuracy is 84%, suggesting the model generalizes reasonably well without significant overfitting.
Class 1 recall (65% on training, 63% on test) is much lower than Class 0, indicating the model struggles to correctly identify positive cases (conversions).
Feature Importance:
Top Positive Predictors: profile_completed_High (OR = 3.57): Data states that those who completed their profile is 3.57 times more likely to convert vs. the reference group. Other top predictors, first_interaction_Website (OR = 3.08) and time_spent_on_website (OR = 2.78) highlight the importance of user engagement.
Top Negative Predictors: first_interaction_Mobile App (OR = 0.25) and profile_completed_Low (OR = 0.29) are most indicative of users unlikely to convert.
Recommendations:
- Handle Multicollinearity: Although multicollinearity has been addressed using drop_first=True, many features still show infinite VIF values, indicating that the issue persists. Decision Trees and Random Forests are better suited for managing multicollinearity without requiring extensive feature engineering. While Lasso Regression or PCA could be applied, the goal is to use regression as a baseline for comparison. As a result, the focus will shift to classification models instead.
Building Classification Models¶
Decision Tree¶
# Define a function to plot the confusion matrix and classification report
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8, 5))
sns.heatmap(cm, annot=True, fmt=".2f", xticklabels=["Converted", "Not Converted"], yticklabels=["Converted", "Not Converted"])
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.title("Confusion Matrix")
plt.show()
# Add the additional features
X = data[['age', 'current_occupation', 'first_interaction', 'profile_completed',
'website_visits', 'time_spent_on_website', 'page_views_per_visit',
'engagement_score', 'interaction_ratio',
'last_activity', 'print_media_type1', 'print_media_type2',
'digital_media', 'educational_channels', 'referral']]
# Convert categorical variables to dummy variables - was getting error so re-getting dummy variable
X = pd.get_dummies(X, drop_first=False) # Apply one-hot encoding with drop_first=True to avoid multicollinearity
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state = 7)
d_tree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=7)
# Checking performance on the training data
y_pred_train1 = d_tree.predict(X_train)
metrics_score(y_train, y_pred_train1)
precision recall f1-score support
0 1.00 1.00 1.00 2273
1 1.00 1.00 1.00 955
accuracy 1.00 3228
macro avg 1.00 1.00 1.00 3228
weighted avg 1.00 1.00 1.00 3228
# Checking performance on the testing data
y_pred_test1 = d_tree.predict(X_test)
metrics_score(y_test, y_pred_test1)
precision recall f1-score support
0 0.85 0.86 0.86 962
1 0.68 0.67 0.67 422
accuracy 0.80 1384
macro avg 0.76 0.76 0.76 1384
weighted avg 0.80 0.80 0.80 1384
Observations:
- The Decision Tree model is overfitting the training data, as expected, which reduces its ability to generalize well on the test set. Overfitting explains the perfect scores (e.g., precision, recall, F1-score) on the training set and the performance drop on the test set.
Test set performance shows:
Overall accuracy is 80%, which is solid but leaves room for improvement.
It’s worth noting that the imbalance in the dataset (majority vs. minority classes) affects the model’s ability to consistently predict the minority class. This is evident in the lower performance metrics for the minority class compared to the majority class in your second classification report.
Decision Tree - Hyperparameter Tuning¶
Class Imbalance Handling: By setting class_weight = 'balanced', addresses class imbalance, giving more weight to the minority class (Class 1), which is critical in imbalanced datasets.
Hyperparameter Tuning: Using GridSearchCV to tune parameters like max_depth, criterion, and min_samples_leaf ensures the model is optimized for the best performance.
Custom Scorer: By defining a custom scorer with f1_score for Class 1, focusing on a metric that balances precision and recall for the minority class. Since this is a startup, balancing resources (sales team capacity) while trying to maximize conversions.
Cross-Validation (cv = 5): Using cross-validation increases the robustness of hyperparameter tuning and guards against overfitting.
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = 'balanced')
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(5),
min_samples_leaf=20, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(5),
min_samples_leaf=20, random_state=7)# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)
metrics_score(y_train, y_pred_train2)
precision recall f1-score support
0 0.94 0.84 0.89 2273
1 0.70 0.86 0.77 955
accuracy 0.85 3228
macro avg 0.82 0.85 0.83 3228
weighted avg 0.87 0.85 0.85 3228
Observations:
- We can see that the performance on the training data has decreased which can be expected as we are trying not to overfit the training dataset.
- The model is still able to identify conversion rates.
# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)
metrics_score(y_test, y_pred_test2)
precision recall f1-score support
0 0.92 0.85 0.89 962
1 0.71 0.84 0.77 422
accuracy 0.85 1384
macro avg 0.82 0.85 0.83 1384
weighted avg 0.86 0.85 0.85 1384
Observations:
- Model Generalization: Consistent performance between training and test sets suggests that hyperparameter tuning reduced overfitting effectively.
- The lack of a significant gap between the training and test results (e.g., F1 scores, precision, recall) confirms that overfitting is effectively controlled. This reinforces the stability and reliability of the model's performance.
- The Decision Tree shows better handling of the minority class (1), as seen in its higher recall and balanced F1-score.
- The Decision Tree has higher accuracy compared to logistic regression.
- The inclusion of additional features did not significantly affect the results, suggesting that the underlying features already capture the most critical predictive information.
We can reduce the depth to 3 and visualize it
tree_model = DecisionTreeClassifier(class_weight = 'balanced', max_depth = 3,
min_samples_leaf = 5, random_state = 7)
# Fit the best algorithm to the data
tree_model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=3, min_samples_leaf=5,
random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=3, min_samples_leaf=5,
random_state=7)features = list(X.columns)
plt.figure(figsize = (20, 20))
tree.plot_tree(tree_model, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = None)
plt.show()
Observations:
The tree starts with the feature first_interaction_Mobile App at the root, splitting based on whether the value is <= 0.5. This node samples where the first interaction was NOT through the mobile app.
This indicates that first_interaction_Mobile App is the most significant feature in determining outcomes, as it’s the first decision point.
2nd most important is time spent on the website.
Interestingly, users whose first interaction is via the mobile app appear less likely to convert. This could point to potential issues with the mobile app experience, making it less engaging or effective compared to the website. This aligns with insights from the logistic regression model, reinforcing the conclusion.
# Importance of features in the tree building
print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp first_interaction_Mobile App 0.319966 time_spent_on_website 0.259234 profile_completed_High 0.245264 current_occupation_Professional 0.052392 current_occupation_Student 0.034057 last_activity_Phone Activity 0.028137 last_activity_Website Activity 0.025097 age 0.014836 profile_completed_Medium 0.008904 engagement_score 0.008230 website_visits 0.003880 first_interaction_Website 0.000000 current_occupation_Unemployed 0.000000 page_views_per_visit 0.000000 interaction_ratio 0.000000 last_activity_Email Activity 0.000000 profile_completed_Low 0.000000 print_media_type1_No 0.000000 print_media_type1_Yes 0.000000 print_media_type2_No 0.000000 print_media_type2_Yes 0.000000 digital_media_No 0.000000 digital_media_Yes 0.000000 educational_channels_No 0.000000 educational_channels_Yes 0.000000 referral_No 0.000000 referral_Yes 0.000000
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
Most important features are first_interaction_Mobile App, time_spent_on_website, and profile_competed_High.
The decision tree suggests that students are less likely to convert compared to professionals. This is evident in splits where the feature current_occupation_Student directs samples toward branches with a lower proportion of conversions. This also agrees with earlier bivariate plots and the regression analysis.
Media and referral features seem to contribute minimally to this analysis. While referral cases exhibit high conversion rates, their overall impact remains limited due to the small proportion of users who come through referrals.
Random Forest Classifier¶
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_pred_train3 = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train3)
precision recall f1-score support
0 1.00 1.00 1.00 2273
1 1.00 1.00 1.00 955
accuracy 1.00 3228
macro avg 1.00 1.00 1.00 3228
weighted avg 1.00 1.00 1.00 3228
Observations:
- Similar to the decision tree, the random forest is giving a perfect/better performance on the training data.
- The model is most likely overfitting to the training dataset as we observed for the decision tree.
Let's confirm this by checking its performance on the testing data
# Checking performance on the testing data
y_pred_test3 = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test3)
precision recall f1-score support
0 0.87 0.92 0.89 962
1 0.79 0.69 0.74 422
accuracy 0.85 1384
macro avg 0.83 0.81 0.82 1384
weighted avg 0.85 0.85 0.85 1384
Observations:
Overfitting in the First Set: The near-perfect performance on the training set suggests the model is overfitted, capturing noise or patterns specific to the training data.
Better Generalization in the Second Set: The drop in performance on the test set, especially for class 1, indicates the model struggles to generalize to unseen data.
Random Forest Classifier - Hyperparameter Tuning¶
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
"max_depth": [6, 7],
"min_samples_leaf": [20, 25],
"max_features": [0.8, 0.9],
"max_samples": [0.9, 1],
"class_weight" : ["balanced",{0: 0.3, 1: 0.7}]
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned_base = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned_base.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
max_depth=7, max_features=0.8, max_samples=0.9,
min_samples_leaf=20, n_estimators=110, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
max_depth=7, max_features=0.8, max_samples=0.9,
min_samples_leaf=20, n_estimators=110, random_state=7)# Checking performance on the training data
y_pred_train4 = rf_estimator_tuned_base.predict(X_train)
metrics_score(y_train, y_pred_train4)
precision recall f1-score support
0 0.94 0.86 0.90 2273
1 0.72 0.86 0.78 955
accuracy 0.86 3228
macro avg 0.83 0.86 0.84 3228
weighted avg 0.87 0.86 0.86 3228
# Checking performance on the test data
y_pred_test4 = rf_estimator_tuned_base.predict(X_test)
metrics_score(y_test, y_pred_test4)
precision recall f1-score support
0 0.92 0.87 0.89 962
1 0.73 0.83 0.78 422
accuracy 0.85 1384
macro avg 0.82 0.85 0.83 1384
weighted avg 0.86 0.85 0.86 1384
Observations:
Performance Consistency: Training and test set metrics are very close, with no significant drops in performance. Not overfitted.
Hyperparameter Tuning: There is potential to further refine the model by experimenting with additional hyperparameters or adjusting the current hyperparameter values to enhance performance.
Efficiency in Tuning: Acknowledging that GridSearchCV can be computationally intensive, the number of values passed to each hyperparameter has been intentionally reduced to balance runtime with optimization efforts.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
"max_depth": [6, 7],
"min_samples_leaf": [20, 25],
"max_features": [0.8, 0.9],
"max_samples": [0.9, 1],
"class_weight" : ["balanced",{0: 0.3, 1: 0.7}]
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
max_depth=7, max_features=0.8, max_samples=0.9,
min_samples_leaf=20, n_estimators=110, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
max_depth=7, max_features=0.8, max_samples=0.9,
min_samples_leaf=20, n_estimators=110, random_state=7)# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train5)
precision recall f1-score support
0 0.94 0.86 0.90 2273
1 0.72 0.86 0.78 955
accuracy 0.86 3228
macro avg 0.83 0.86 0.84 3228
weighted avg 0.87 0.86 0.86 3228
# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test5)
precision recall f1-score support
0 0.92 0.87 0.89 962
1 0.73 0.83 0.78 422
accuracy 0.85 1384
macro avg 0.82 0.85 0.83 1384
weighted avg 0.86 0.85 0.86 1384
Observations:
Performance Summary
Accuracy: 86% for both training and 85% test sets, showing good generalization.
Class 0: High precision (95%, 92%) and recall (86%, 87%), with strong F1-scores (90%, 89%).
Class 1: Moderate precision (72%, 73%) but strong recall (86%, 83%), F1-scores at 78% for both sets.
After hyperparameter tuning, the Random Forest model performs slightly better than the Decision Tree, achieving marginally higher overall accuracy and F1 scores for Class 1. However, the improvement is not substantial. If computational efficiency is a priority, the Decision Tree remains a valid choice due to its comparable performance. Additionally, decision trees are somewhat easier to interpret.
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize = (12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
- The most important features appear to be time_spent_on_website, first_interaction_Website, and profile_completed_High, followed by profile_completed_medium. This model places slightly more emphasis on time_spent_on_website compared to first_interaction_Website than the decision tree does, but the top three features remain consistent.
- Many other features, including media-related and referral features, contribute little to nothing in importance.
- While a high number of referrals converted, there were very few referrals compared to non referrals.While it is an effective channel, it has limited reach. *Random Forest after tuning had the most accuracy. Decision Trees were almost as good with slightly better interpretibility. The models seem to agree in general.
XGBoost Regressor¶
from xgboost import XGBClassifier
# Choose the type of classifier
xgb_estimator_tuned = XGBClassifier(eval_metric='logloss', random_state=7)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110, 120],
"max_depth": [6, 7],
"learning_rate": [0.1, 0.2],
"subsample": [0.8, 0.9],
"colsample_bytree": [0.8, 0.9],
"scale_pos_weight": [1, 3], # Adjust for class imbalance
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label=1)
# Run the grid search
grid_obj = GridSearchCV(xgb_estimator_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
xgb_estimator_tuned = grid_obj.best_estimator_
# Predictions and Metrics for XGBoost
y_pred_train_xgb = xgb_estimator_tuned.predict(X_train)
y_pred_test_xgb = xgb_estimator_tuned.predict(X_test)
print("XGBoost - Training Results:")
metrics_score(y_train, y_pred_train_xgb)
print("XGBoost - Test Results:")
metrics_score(y_test, y_pred_test_xgb)
XGBoost - Training Results:
precision recall f1-score support
0 0.97 0.98 0.97 2273
1 0.94 0.92 0.93 955
accuracy 0.96 3228
macro avg 0.95 0.95 0.95 3228
weighted avg 0.96 0.96 0.96 3228
XGBoost - Test Results:
precision recall f1-score support
0 0.88 0.92 0.90 962
1 0.79 0.71 0.75 422
accuracy 0.86 1384
macro avg 0.84 0.81 0.82 1384
weighted avg 0.85 0.86 0.85 1384
Observations:
The XGBoost might be overfitting as training results are much better than test results. Apply regularization techniques, just as we did for Decision Trees and Random Forest.
# Choose the type of classifier
xgb_estimator_tuned = XGBClassifier(eval_metric='logloss', random_state=7)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100, 200],
"max_depth": [3, 5, 7],
"learning_rate": [0.01, 0.1, 0.2],
"subsample": [0.8, 0.9, 1.0],
"colsample_bytree": [0.8, 0.9, 1.0],
"scale_pos_weight": [1, 3], # Adjust for class imbalance
"reg_alpha": [0, 0.1, 1], # L1 regularization
"reg_lambda": [1, 1.5, 2] # L2 regularization
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label=1)
# Run the randomized search
random_search = RandomizedSearchCV(xgb_estimator_tuned, parameters, scoring=scorer, cv=5, n_iter=50, n_jobs=multiprocessing.cpu_count())
random_search = random_search.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
xgb_estimator_tuned = random_search.best_estimator_
# Predictions and Metrics for XGBoost
y_pred_train_xgb = xgb_estimator_tuned.predict(X_train)
y_pred_test_xgb = xgb_estimator_tuned.predict(X_test)
print("XGBoost - Training Results:")
metrics_score(y_train, y_pred_train_xgb)
print("XGBoost - Test Results:")
metrics_score(y_test, y_pred_test_xgb)
XGBoost - Training Results:
precision recall f1-score support
0 0.90 0.94 0.92 2273
1 0.84 0.75 0.80 955
accuracy 0.89 3228
macro avg 0.87 0.85 0.86 3228
weighted avg 0.88 0.89 0.88 3228
XGBoost - Test Results:
precision recall f1-score support
0 0.88 0.93 0.91 962
1 0.82 0.71 0.76 422
accuracy 0.86 1384
macro avg 0.85 0.82 0.83 1384
weighted avg 0.86 0.86 0.86 1384
# SHAP values for feature importance
explainer = shap.Explainer(xgb_estimator_tuned)
shap_values = explainer(X_train)
# Plot the SHAP summary plot for feature importance
shap.summary_plot(shap_values, X_train)
Observations:
- XGBoost and Random Forest both identify the same top three features as the most important. However, XGBoost places Professional in the fourth position, while Random Forest ranks it lower. Despite this difference, the overall insights from both models remain consistent.
Observations:
- Accuracy: Both models have similar accuracy on the test set (0.85 for Random Forest and 0.86 for XGBoost).
- Class 0 Performance: XGBoost has slightly higher recall and F1-score for Class 0 on the test set.
- Class 1 Performance: Random Forest has higher recall for Class 1 on the test set, while XGBoost has higher precision.
- Overfitting: XGBoost shows a larger gap between training and test performance, indicating potential overfitting. Regularization techniques applied to XGBoost (like reg_alpha and reg_lambda) help mitigate this but might need further tuning.
- Random Forest: Easier to interpret and performs well with balanced precision and recall.
- XGBoost: Slightly better overall performance but may require careful tuning to avoid overfitting.
Final Takeaways¶
Model Summary
Each iteration—starting with Logistic Regression, progressing to tuned Decision Trees, Random Forest, and finally tuned XGBoost—demonstrated incremental improvements in overall performance metrics.
While the Decision Tree model is slightly less effective, it still performs well as a standalone option, especially when computational power is a limiting factor.
Logistic Regression excels in metrics for Class 0 but struggles with recall for Class 1, making it less suited for imbalanced datasets. It also suffers from multicollinearity issues, which tree-based models like Decision Trees, Random Forest, and XGBoost handle more effectively.
The models showed consistency in identifying top, medium, and insignificant contributors. Although the ranking of features varied slightly, the results remained largely consistent.
Business Insights
Mobile App Experience: Users who start their journey with the mobile app show lower conversion rates. This indicates a need to improve the app experience, as the website currently provides a more effective pathway for user engagement and conversion.
Digital Media Ads: With no noticeable impact on conversions, the budget allocated to digital ads could be redirected towards enhancing the mobile app experience, which holds greater potential for improving results.
User Engagement and Profile Completion: High and medium profile completion rates are critical factors for conversions. Marketing strategies should target users with at least medium-level profile completion, and personalized reminder emails can be sent to encourage sign-ups.
Website Optimization: Website-related factors consistently rank as the most influential drivers of conversion. Enhancing website functionality and engagement features should remain a top priority, focusing on users who are highly active on the site.
Referrals: Although referrals were not identified as significant by the models due to the limited number of cases, their high conversion rates are evident in bivariate analysis. Introducing referral bonus programs could increase the volume of referrals, amplifying their impact in future models.
Student Segment: Students are the least likely to enroll, possibly due to existing school options that reduce the need for EdTech solutions. Introducing tiered pricing could make the offerings more appealing and increase demand within this segment.